This report explores the relationships between various factors influencing student performance, using exploratory data analysis (EDA) to identify key trends and correlations. The analysis focuses on variables such as study habits, access to resources, parental involvement, and environmental factors, and how they impact final exam scores. Insights gained from the data will inform recommendations aimed at improving academic outcomes for students.
The dataset was sourced from Kaggle under the CC0 1.0 universal “No Copyright” license. We are free to copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. Learn more about this license here here.
URL for data in Kaggle: Student Performance Factors Dataset
student_data <- read.csv('../data/StudentPerformanceFactors.csv', header = TRUE)
student_data # Display the dataset Hours_Studied Attendance Parental_Involvement Access_to_Resources
Min. : 1.00 Min. : 60.00 Length:6607 Length:6607
1st Qu.:16.00 1st Qu.: 70.00 Class :character Class :character
Median :20.00 Median : 80.00 Mode :character Mode :character
Mean :19.98 Mean : 79.98
3rd Qu.:24.00 3rd Qu.: 90.00
Max. :44.00 Max. :100.00
Extracurricular_Activities Sleep_Hours Previous_Scores
Length:6607 Min. : 4.000 Min. : 50.00
Class :character 1st Qu.: 6.000 1st Qu.: 63.00
Mode :character Median : 7.000 Median : 75.00
Mean : 7.029 Mean : 75.07
3rd Qu.: 8.000 3rd Qu.: 88.00
Max. :10.000 Max. :100.00
Motivation_Level Internet_Access Tutoring_Sessions Family_Income
Length:6607 Length:6607 Min. :0.000 Length:6607
Class :character Class :character 1st Qu.:1.000 Class :character
Mode :character Mode :character Median :1.000 Mode :character
Mean :1.494
3rd Qu.:2.000
Max. :8.000
Teacher_Quality School_Type Peer_Influence Physical_Activity
Length:6607 Length:6607 Length:6607 Min. :0.000
Class :character Class :character Class :character 1st Qu.:2.000
Mode :character Mode :character Mode :character Median :3.000
Mean :2.968
3rd Qu.:4.000
Max. :6.000
Learning_Disabilities Parental_Education_Level Distance_from_Home
Length:6607 Length:6607 Length:6607
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Gender Exam_Score
Length:6607 Min. : 55.00
Class :character 1st Qu.: 65.00
Mode :character Median : 67.00
Mean : 67.24
3rd Qu.: 69.00
Max. :101.00
'data.frame': 6607 obs. of 20 variables:
$ Hours_Studied : int 23 19 24 29 19 19 29 25 17 23 ...
$ Attendance : int 84 64 98 89 92 88 84 78 94 98 ...
$ Parental_Involvement : chr "Low" "Low" "Medium" "Low" ...
$ Access_to_Resources : chr "High" "Medium" "Medium" "Medium" ...
$ Extracurricular_Activities: chr "No" "No" "Yes" "Yes" ...
$ Sleep_Hours : int 7 8 7 8 6 8 7 6 6 8 ...
$ Previous_Scores : int 73 59 91 98 65 89 68 50 80 71 ...
$ Motivation_Level : chr "Low" "Low" "Medium" "Medium" ...
$ Internet_Access : chr "Yes" "Yes" "Yes" "Yes" ...
$ Tutoring_Sessions : int 0 2 2 1 3 3 1 1 0 0 ...
$ Family_Income : chr "Low" "Medium" "Medium" "Medium" ...
$ Teacher_Quality : chr "Medium" "Medium" "Medium" "Medium" ...
$ School_Type : chr "Public" "Public" "Public" "Public" ...
$ Peer_Influence : chr "Positive" "Negative" "Neutral" "Negative" ...
$ Physical_Activity : int 3 4 4 4 4 3 2 2 1 5 ...
$ Learning_Disabilities : chr "No" "No" "No" "No" ...
$ Parental_Education_Level : chr "High School" "College" "Postgraduate" "High School" ...
$ Distance_from_Home : chr "Near" "Moderate" "Near" "Moderate" ...
$ Gender : chr "Male" "Female" "Male" "Male" ...
$ Exam_Score : int 67 61 74 71 70 71 67 66 69 72 ...
[1] 0
Here we will explore the distribution of final exam scores among students with without considering other factors. To find our the distribution of final exam scores, we first need to sample the data and plot a histogram.
# Sample the data
set.seed(123)
exam_score_sample <- student_data$Exam_Score[sample(nrow(student_data), 100)]
exam_score_sample [1] 68 66 74 63 71 67 62 74 67 63 64 72 59 63 63 69 68 63 71 70 66 61 68 71 64
[26] 64 69 73 70 65 72 65 64 70 69 69 62 68 68 71 69 73 73 66 65 65 69 67 68 69
[51] 67 65 66 61 66 68 70 66 67 68 70 72 75 67 65 69 72 68 66 66 67 64 65 72 67
[76] 71 69 59 69 69 66 68 63 67 62 63 72 65 65 63 58 67 65 69 71 66 67 65 61 71
# Plot histogram
hist(exam_score_sample, main = "Distribution of Final Exam Scores", xlab = "Final Exam Score", col = "skyblue", border = "black")From the histogram, we can see that the distribution of final exam scores is approximately normal. We now plot a boxplot to visualize the spread of scores and identify any outliers.
# Boxplot of final exam scores
boxplot(exam_score_sample, main = "Boxplot of Final Exam Scores", col = "skyblue", border = "black")The boxplot shows that the distribution of final exam scores is centered around the median, with a few outliers on the lower end of the scale. Now we use numerical methods to confirm the normality of the distribution.
Shapiro-Wilk normality test
data: exam_score_sample
W = 0.98776, p-value = 0.4903
The Shapiro-Wilk test confirms that the distribution of final exam scores is approximately normal, with a p-value greater than 0.05.
Next, we explore the distribution of final exam scores based on parental involvement levels. We will create a boxplot to compare the scores of students with different levels of parental involvement.
# Sample the data
high_parental_involvement <- student_data$Exam_Score[student_data$Parental_Involvement == "High"][sample(sum(student_data$Parental_Involvement == "High"), 100)]
medium_parental_involvement <- student_data$Exam_Score[student_data$Parental_Involvement == "Medium"][sample(sum(student_data$Parental_Involvement == "Medium"), 100)]
low_parental_involvement <- student_data$Exam_Score[student_data$Parental_Involvement == "Low"][sample(sum(student_data$Parental_Involvement == "Low"), 100)]
# Histogram of final exam scores by parental involvement
par(mfrow = c(1, 3))
hist(high_parental_involvement, main = "High Parental Involvement", xlab = "Final Exam Score", col = "skyblue", border = "black")
hist(medium_parental_involvement, main = "Medium Parental Involvement", xlab = "Final Exam Score", col = "skyblue", border = "black")
hist(low_parental_involvement, main = "Low Parental Involvement", xlab = "Final Exam Score", col = "skyblue", border = "black")The three histograms show the distribution of final exam scores for students with high, medium, and low levels of parental involvement. We can see that the distribution of the scores seems to be similar across all three categories. They seem to follow a normal distribution, with a slight skew towards higher scores for students with high parental involvement. We now use numerical methods to confirm the normality of the distributions.
Shapiro-Wilk normality test
data: high_parental_involvement
W = 0.9794, p-value = 0.1194
Shapiro-Wilk normality test
data: medium_parental_involvement
W = 0.98125, p-value = 0.1662
Shapiro-Wilk normality test
data: low_parental_involvement
W = 0.96994, p-value = 0.02186
The Shapiro-Wilk test confirms that the distributions of final exam scores for students with high and medium levels of parental involvement are approximately normal, with p-values greater than 0.05. However, the distribution for students with low parental involvement is slightly skewed, with a p-value less than 0.05.
We now investigate the distribution of final exam scores for students with low parental involvement further. We will now plot a density plot to visualize the distribution more clearly.
# Density plot of final exam scores for students with low parental involvement
plot(density(low_parental_involvement), main = "Density Plot of Final Exam Scores for Low Parental Involvement", xlab = "Final Exam Score", col = "skyblue")The density plot shows that the distribution of final exam is slightly skewed to the left for students with low parental involvement. We will now create a QQ plot to compare the distribution of scores to a normal distribution.
# QQ plot of final exam scores for students with low parental involvement
qqnorm(low_parental_involvement, main = "QQ Plot of Final Exam Scores for Low Parental Involvement", col = "skyblue")
qqline(low_parental_involvement, col = "red")The QQ plot confirms that the distribution of final exam scores for students with low parental involvement is slightly skewed to the left, deviating from a normal distribution.
Next, we explore the distribution of final exam scores based on access to resources. We will create a boxplot to compare the scores of students with different levels of access to resources.
# Sample the data
high_access_to_resources <- student_data$Exam_Score[student_data$Access_to_Resources == "High"][sample(sum(student_data$Access_to_Resources == "High"), 100)]
medium_access_to_resources <- student_data$Exam_Score[student_data$Access_to_Resources == "Medium"][sample(sum(student_data$Access_to_Resources == "Medium"), 100)]
low_access_to_resources <- student_data$Exam_Score[student_data$Access_to_Resources == "Low"][sample(sum(student_data$Access_to_Resources == "Low"), 100)]
# Histogram of final exam scores by access to resources
par(mfrow = c(1, 3))
hist(high_access_to_resources, main = "High Access to Resources", xlab = "Final Exam Score", col = "skyblue", border = "black")
hist(medium_access_to_resources, main = "Medium Access to Resources", xlab = "Final Exam Score", col = "skyblue", border = "black")
hist(low_access_to_resources, main = "Low Access to Resources", xlab = "Final Exam Score", col = "skyblue", border = "black")plot(density(low_access_to_resources), main = "Low Access to Resources", xlab = "Final Exam Score", col = "skyblue")
plot(density(medium_access_to_resources), main = "Medium Access to Resources", xlab = "Final Exam Score", col = "skyblue")
plot(density(high_access_to_resources), main = "High Access to Resources", xlab = "Final Exam Score", col = "skyblue")The histograms and density plots show the distribution of final exam scores for students with high, medium, and low levels of access to resources. The distributions seem to be similar across all three categories, with a slight skew towards higher scores for students with high access to resources. We will now use numerical methods to confirm the normality of the distributions.
Shapiro-Wilk normality test
data: high_access_to_resources
W = 0.85758, p-value = 2.283e-08
Shapiro-Wilk normality test
data: medium_access_to_resources
W = 0.73995, p-value = 5.121e-12
Shapiro-Wilk normality test
data: low_access_to_resources
W = 0.77633, p-value = 4.942e-11
The Shapiro-Wilk test shows that the distribution of all three categories of access to resources is not normal, with p-values less than 0.05. This indicates that the distribution of final exam scores is skewed for students with different levels of access to resources.
Next, we explore the distribution of final exam scores based on participation in extracurricular activities. We will create a boxplot to compare the scores of students who participate in extracurricular activities and those who do not.
# Sample the data
participate_extracurricular <- student_data$Exam_Score[student_data$Extracurricular_Activities == "Yes"][sample(sum(student_data$Extracurricular_Activities == "Yes"), 100)]
do_not_participate_extracurricular <- student_data$Exam_Score[student_data$Extracurricular_Activities == "No"][sample(sum(student_data$Extracurricular_Activities == "No"), 100)]
# Boxplot of final exam scores by extracurricular activities
boxplot(student_data$Exam_Score ~ student_data$Extracurricular_Activities, main = "Final Exam Scores by Extracurricular Activities", xlab = "Extracurricular Activities", ylab = "Final Exam Score", col = "skyblue", border = "black")The boxplot shows that students who participate in extracurricular activities tend to have higher final exam scores compared to those who do not. Now we will visualize the distribution of scores for both groups using histograms.
# Histogram of final exam scores by extracurricular activities
par(mfrow = c(1, 2))
hist(participate_extracurricular, main = "Extracurricular Activities", xlab = "Final Exam Score", col = "skyblue", border = "black")
hist(do_not_participate_extracurricular, main = "No Extracurricular Activities", xlab = "Final Exam Score", col = "skyblue", border = "black")Both histograms show right skewed distributions, with students who participate in extracurricular activities having higher final exam scores. We will now use numerical methods to confirm the normality of the distributions with the following hypothesis test.
Shapiro-Wilk normality test
data: participate_extracurricular
W = 0.88155, p-value = 2.112e-07
The Shapiro-Wilk test confirms that the distribution of final exam scores for students who participate in extracurricular activities is not normal, with a p-value less than 0.05. Thus, we reject the null hypothesis.
Next we explore the distribution of final exam scores based on motivation levels. We will create a histogram to compare the scores of students with different motivation levels.
# Smaple Data
high_motivation <- student_data$Exam_Score[student_data$Motivation_Level == "High"][sample(sum(student_data$Motivation_Level == "High"), 100)]
medium_motivation <- student_data$Exam_Score[student_data$Motivation_Level == "High"][sample(sum(student_data$Motivation_Level == "Medium"), 100)]
low_motivation <- student_data$Exam_Score[student_data$Motivation_Level == "High"][sample(sum(student_data$Motivation_Level == "Low"), 100)]
# Histogram of final exam scores by motivation level
par(mfrow = c(1, 3))
hist(high_motivation, main = "High Motivation Level", xlab = "Final Exam Score", col = "skyblue", border = "black")
hist(medium_motivation, main = "Medium Motivation Level", xlab = "Final Exam Score", col = "skyblue", border = "black")
hist(low_motivation, main = "Low Motivation Level", xlab = "Final Exam Score", col = "skyblue", border = "black")The histograms for High and medium motivation levels show a normal distribution of final exam scores, while the low motivation level histogram shows a right-skewed distribution. We will now use numerical methods to confirm the normality of the distributions.
Shapiro-Wilk normality test
data: high_motivation
W = 0.98236, p-value = 0.202
Shapiro-Wilk normality test
data: medium_motivation
W = 0.9695, p-value = 0.4479
Shapiro-Wilk normality test
data: low_motivation
W = 0.73846, p-value = 2.385e-09
The Shapiro-Wilk test confirms that the distributions of final exam scores for students with high and medium motivation levels are approximately normal, with p-values greater than 0.05. However, the distribution for students with low motivation levels is slightly skewed, with a p-value less than 0.05.
In summary, the distribution of final exam scores is approximately normal when considering all students. However, when examining the scores based on other factors such as parental involvement, access to resources, extracurricular activities, and motivation levels, the distributions vary. Students with high parental involvement and high access to resources tend to have higher final exam scores, while students who participate in extracurricular activities also perform better. Motivation levels also play a role in student performance, with students who are highly motivated achieving higher scores.
Next, we explore the average number of hours students study per week without considering other factors.
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 19.98 24.00 44.00
The summary statistics show that the average number of hours students study per week is approximately 19.98 hours. We will now visualize the distribution of study hours using a histogram.
# Histogram of study hours
hist(student_data$Hours_Studied, main = "Distribution of Study Hours", xlab = "Study Hours", col = "skyblue", border = "black")Next, we will explore the average number of hours students study per week based on parental involvement levels.
# Summary statistics for study hours by parental involvement
summary(student_data$Hours_Studied[student_data$Parental_Involvement == "High"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 19.85 24.00 44.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 16.00 20.00 19.98 24.00 39.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 16.00 20.00 20.13 24.00 38.00
The summary statistics of the average number of hours students study per week based on parental involvement levels do not show significant differences. We will now use ANOVA to test for differences in study hours based on parental involvement levels. But first, we need to check the assumptions of ANOVA.
# Sample the data
high_parental_involvement <- student_data$Hours_Studied[student_data$Parental_Involvement == "High"][sample(sum(student_data$Parental_Involvement == "High"), 100)]
medium_parental_involvement <- student_data$Hours_Studied[student_data$Parental_Involvement == "Medium"][sample(sum(student_data$Parental_Involvement == "Medium"), 100)]
low_parental_involvement <- student_data$Hours_Studied[student_data$Parental_Involvement == "Low"][sample(sum(student_data$Parental_Involvement == "Low"), 100)]
# Histogram of study hours by parental involvement
par(mfrow = c(1, 3))
hist(high_parental_involvement, main = "High Parental Involvement", xlab = "Study Hours", col = "skyblue", border = "black")
hist(medium_parental_involvement, main = "Medium Parental Involvement", xlab = "Study Hours", col = "skyblue", border = "black")
hist(low_parental_involvement, main = "Low Parental Involvement", xlab = "Study Hours", col = "skyblue", border = "black")
Shapiro-Wilk normality test
data: high_parental_involvement
W = 0.98554, p-value = 0.3473
Shapiro-Wilk normality test
data: medium_parental_involvement
W = 0.97749, p-value = 0.08461
Shapiro-Wilk normality test
data: low_parental_involvement
W = 0.98649, p-value = 0.4041
Both the histograms and the Shapiro-Wilk test show that the distribution of study hours is approximately normal for all students. We will now use ANOVA to test for differences in study hours based on parental involvement levels.
# ANOVA test for study hours by parental involvement
anova_parental_involvement <- aov(Hours_Studied ~ Parental_Involvement, data = student_data)
summary(anova_parental_involvement) Df Sum Sq Mean Sq F value Pr(>F)
Parental_Involvement 2 62 30.86 0.86 0.423
Residuals 6604 237009 35.89